Unit testing invalid Unicode sequences with xUnit and InlineData
I was unit testing some invalid user inputs in C# with xUnit v3 and was getting weird results when using [InlineData]. It turns out this was by xUnit’s design, and here is my workaround.
This is a short post for future me. I had a simple unit test for my string sanitizer to test some invalid Unicode strings, like this:
[Theory]
[InlineData("\uDC00", "_")] // Isolated low surrogate
[InlineData("\uD800\uD800", "__")] // Two high surrogates
[InlineData("\uD800\uDC00", "\uD800\uDC00")] // Valid pair
public void TestInvalidUnicodeSequences(string input, string expected)
{
var result = PathUtils.SanitizeDirectoryName(input);
Assert.Equal(expected, result);
}
And it was failing…
InlineDataThe thing that caught my eye was the number of characters. Why were there three and six undisplayable characters when the input had been one and two characters, respectively? After debugging every mistake I could have made, I found this GitHub issue about xUnit’s serialization of Unicode strings
. It was closed because this behaviour is by design. The issue linked to an older discussion about string values in MemberData
. Essentially, xUnit serializes strings to UTF-8, which breaks invalid Unicode sequences.
“Non-Unicode legal strings will get “mangled” when converted to Unicode during the serialization process because we convert to UTF-8. A single D800 is, by itself, not legal Unicode.”
Solution with char[]
Luckily, there is an easy workaround: just use char[] for tests that need to handle invalid Unicode sequences, and manually convert the array back to string with new string(input). Here is the updated test:
[Theory]
[InlineData(new[] { '\uDC00' }, "_")] // Isolated low surrogate
[InlineData(new[] { '\uD800', '\uD800' }, "__")] // Two high surrogates
[InlineData(new[] { '\uD800', '\uDC00' }, "\uD800\uDC00")] // Valid pair
public void TestInvalidUnicodeSequences(char[] input, string expected)
{
var result = PathUtils.SanitizeDirectoryName(new string(input));
Assert.Equal(expected, result);
}
It would be nice if we could use a simpler way of converting the strings into char[], such as using "\uDC00".ToCharArray(), but unfortunately, the arguments in InlineData must be constant expressions. At least this method achieves what we need.